Advance Analytics with R (UG 21-24)
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
Dip our toes into classification techniques. How to apply and assess these methods.
References for this lecture:
“….often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.”
Default data| default | student | balance | income |
|---|---|---|---|
| No | No | 729.5265 | 44361.625 |
| No | Yes | 817.1804 | 12106.135 |
| No | No | 1073.5492 | 31767.139 |
| No | No | 529.2506 | 35704.494 |
| No | No | 785.6559 | 38463.496 |
| No | Yes | 919.5885 | 7491.559 |
| No | No | 825.5133 | 24905.227 |
| No | Yes | 808.6675 | 17600.451 |
| No | No | 1161.0579 | 37468.529 |
| No | No | 0.0000 | 29275.268 |
| No | Yes | 0.0000 | 21871.073 |
| No | Yes | 1220.5838 | 13268.562 |
| No | No | 237.0451 | 28251.695 |
| No | No | 606.7423 | 44994.556 |
| No | No | 1112.9684 | 23810.174 |
| No | No | 286.2326 | 45042.413 |
| No | No | 0.0000 | 50265.312 |
| No | Yes | 527.5402 | 17636.540 |
| No | No | 485.9369 | 61566.106 |
| No | No | 1095.0727 | 26464.631 |
Default is our response(\(Y\)).Yes or No.I ran this: \(p(balance) = \beta_0 + \beta_1X\)
## make a dummy for default
Default|>
mutate(
default_dumm = ifelse(
default == "Yes",
1,0
)
)-> def_dum
## regress dummy over balance and plot
lm(default_dumm ~ balance,
data = def_dum)|>
broom::augment()|>
ggplot(aes(balance,default_dumm))+
geom_point(alpha= 0.6)+
geom_line(aes(balance, .fitted),
colour = "red")+
labs(
title = "Linear regression fit to qualitative response",
subtitle = "Yes =1, No = 0",
y = "prob default status"
)+
theme_minimal() -> plot_linear
## Run the logistic regression
glm(
default_dumm ~ balance,
data = def_dum,
family = binomial
)|>
broom::augment(type.predict = "response")|>
ggplot(aes(balance,default_dumm))+
geom_point(alpha= 0.6)+
geom_line(aes(balance, .fitted),
colour = "red")+
labs(
title = "Logistic regression fit to qualitative response",
subtitle = "Yes =1, No = 0",
y = "prob default status"
)+
theme_minimal() -> logistic_plotWe saw that some fitted values in the linear model were negative.
We need a function that will return values between [0,1].
\[p(X) = \frac{e^{(\beta_0 + \beta_1X)}}{1+e^{\beta_0 + \beta_1X}}\]
This is the logistic function, modeled by the maximum likelihood method.
odds:
\[\frac{p(X)}{1-p(X)}\] **log odds or logit:
\[log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X\]
if the following are the results of the model \(logit(p(default)) = \beta_0 + \beta_1Balance\):
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -10.651330614 | 0.3611573721 | -29.49221 | 3.623124e-191 |
| balance | 0.005498917 | 0.0002203702 | 24.95309 | 1.976602e-137 |
What is the probability of default with balance $5000??
\[p(X) = \frac{e^{(\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n)}}{1+e^{\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n}}\]
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.086905e+01 | 4.922555e-01 | -22.080088 | 4.911280e-108 |
| income | 3.033450e-06 | 8.202615e-06 | 0.369815 | 7.115203e-01 |
| balance | 5.736505e-03 | 2.318945e-04 | 24.737563 | 4.219578e-135 |
| studentYes | -6.467758e-01 | 2.362525e-01 | -2.737646 | 6.188063e-03 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.5041278 | 0.07071301 | -49.554219 | 0.0000000000 |
| studentYes | 0.4048871 | 0.11501883 | 3.520181 | 0.0004312529 |
There is no consesus in statistics community over a single measure that can describe a goodness of fit for logistic regression.
Use the Credit data in {ISLR}.
What you just did is called Stratified binary model.
to Multinomial Logistic Regression
\[Pr(Y=k|X=x) = \frac{e^{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}xp}}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}\]
for k = 1,…K-1, and
\[Pr(Y=K|X=x) = \frac{1}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}\]
\[log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)}) = \beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}xp\]
Which class is treated as reference or baseline is unimportant.
How to interpret this?
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
multi_log <- nnet::multinom(
formula = species ~ body_mass_g + bill_length_mm + bill_depth_mm + flipper_length_mm + sex + island,
data = peng_ref
)# weights: 27 (16 variable)
initial value 365.837892
iter 10 value 21.914358
iter 20 value 1.629266
iter 30 value 0.026372
final value 0.000049
converged
Call:
nnet::multinom(formula = species ~ body_mass_g + bill_length_mm +
bill_depth_mm + flipper_length_mm + sex + island, data = peng_ref)
Coefficients:
(Intercept) body_mass_g bill_length_mm bill_depth_mm
Adelie 502.6573 -0.08755830 -20.075027 34.82987
Chinstrap -434.3867 -0.02106537 6.332771 -16.48865
flipper_length_mm sexmale islandDream islandTorgersen
Adelie 0.5054518 33.23469 62.03886 144.9809
Chinstrap 1.7645190 -55.22699 335.85058 63.1425
Std. Errors:
(Intercept) body_mass_g bill_length_mm bill_depth_mm
Adelie 0.5314853 2.351402 29.93540 5.286822
Chinstrap 0.5310960 4.080649 29.91681 5.278463
flipper_length_mm sexmale islandDream islandTorgersen
Adelie 49.88305 0.2294146 0.531096 4.701009e-47
Chinstrap 49.81079 0.2290253 0.531096 4.261135e-130
Residual Deviance: 9.874339e-05
AIC: 32.0001
# calculate z-statistics of coefficients
z_stats <- summary(multi_log)$coefficients/
summary(multi_log)$standard.errors
# convert to p-values
p_values <- (1 - pnorm(abs(z_stats)))*2
# display p-values in transposed data frame
data.frame(t(p_values)) Adelie Chinstrap
(Intercept) 0.000000e+00 0.000000000
body_mass_g 9.702963e-01 0.995881131
bill_length_mm 5.024680e-01 0.832357200
bill_depth_mm 4.456258e-11 0.001785562
flipper_length_mm 9.919154e-01 0.971741303
sexmale 0.000000e+00 0.000000000
islandDream 0.000000e+00 0.000000000
islandTorgersen 0.000000e+00 0.000000000
Gentoo Adelie Chinstrap
1 1.565008e-135 1.000000e+00 1.009721e-242
2 3.833780e-97 1.000000e+00 1.450741e-166
3 3.913549e-122 1.000000e+00 1.006490e-181
5 3.854489e-165 1.000000e+00 2.652195e-247
6 2.628864e-168 1.000000e+00 9.671388e-281
7 5.558841e-114 1.000000e+00 3.782674e-190
8 3.576898e-116 1.000000e+00 2.880335e-227
13 3.717985e-108 1.000000e+00 3.470609e-172
14 5.313520e-178 1.000000e+00 2.906555e-297
15 4.397228e-190 1.000000e+00 9.358591e-320
16 4.638659e-132 1.000000e+00 3.570151e-212
17 1.323762e-143 1.000000e+00 1.383387e-216
18 3.906700e-111 1.000000e+00 6.765722e-218
19 2.342237e-174 1.000000e+00 3.742224e-262
20 1.803219e-103 1.000000e+00 6.877397e-206
21 3.440325e-75 1.000000e+00 1.084873e-188
22 2.942167e-90 1.000000e+00 4.087688e-228
23 1.885524e-93 1.000000e+00 8.693837e-211
24 1.304105e-64 1.000000e+00 3.625997e-196
25 2.258588e-50 1.000000e+00 2.714209e-176
26 1.050278e-93 1.000000e+00 4.472482e-212
27 5.070290e-66 1.000000e+00 1.977702e-192
28 4.657299e-56 1.000000e+00 1.773975e-147
29 6.353727e-88 1.000000e+00 1.524263e-202
30 1.459617e-55 1.000000e+00 2.364103e-190
31 1.084357e-69 1.000000e+00 9.173878e-17
32 1.227459e-100 1.000000e+00 5.420913e-94
33 1.265506e-86 1.000000e+00 2.282452e-34
34 8.494378e-82 1.000000e+00 4.163335e-66
35 3.892077e-102 1.000000e+00 1.530044e-47
36 5.034347e-123 1.000000e+00 7.437468e-121
37 3.676963e-116 1.000000e+00 5.542033e-110
38 2.071518e-62 1.000000e+00 3.696922e-16
39 2.421798e-124 1.000000e+00 2.037458e-93
40 6.809138e-66 1.000000e+00 1.601284e-61
41 3.425600e-120 1.000000e+00 7.624079e-81
42 1.606314e-77 1.000000e+00 4.276671e-50
43 6.806524e-135 1.000000e+00 5.590874e-97
44 1.276361e-49 1.000000e+00 3.090448e-26
45 1.481080e-105 1.000000e+00 2.763406e-53
46 2.564070e-66 1.000000e+00 2.715698e-55
47 3.442093e-99 1.000000e+00 7.482183e-81
49 2.187000e-113 1.000000e+00 2.595632e-71
50 2.061308e-96 1.000000e+00 2.896132e-90
51 2.979432e-49 1.000000e+00 3.169427e-145
52 1.697699e-47 1.000000e+00 1.852417e-180
53 3.668664e-95 1.000000e+00 1.072902e-201
54 3.787966e-52 1.000000e+00 1.067250e-172
55 8.398157e-123 1.000000e+00 2.068522e-229
56 4.242336e-55 1.000000e+00 1.503984e-174
57 1.477876e-49 1.000000e+00 3.319481e-146
58 9.797629e-62 1.000000e+00 3.358364e-184
59 2.927841e-83 1.000000e+00 9.109545e-178
60 1.502988e-94 1.000000e+00 3.440099e-226
61 3.045152e-84 1.000000e+00 8.887102e-183
62 4.785268e-68 1.000000e+00 5.169778e-209
63 4.444320e-52 1.000000e+00 3.202953e-150
64 1.419865e-38 1.000000e+00 2.020648e-158
65 2.360178e-92 1.000000e+00 2.038745e-188
66 5.421996e-35 1.000000e+00 4.069958e-151
67 5.476171e-70 1.000000e+00 3.160205e-158
68 2.078893e-49 1.000000e+00 3.188062e-179
69 7.954905e-146 1.000000e+00 1.710102e-209
70 1.077474e-99 1.000000e+00 7.562552e-198
71 3.865774e-182 1.000000e+00 1.261787e-274
72 4.918193e-122 1.000000e+00 6.676413e-220
73 6.012949e-105 1.000000e+00 1.034230e-162
74 1.910730e-68 1.000000e+00 4.867732e-150
75 3.284153e-138 1.000000e+00 2.276590e-215
76 2.618102e-84 1.000000e+00 9.775178e-174
77 9.229092e-81 1.000000e+00 2.732396e-137
78 1.219806e-157 1.000000e+00 3.841097e-274
79 5.636064e-116 1.000000e+00 4.127281e-182
80 5.403569e-109 1.000000e+00 2.345329e-202
81 2.595345e-160 1.000000e+00 5.445574e-234
82 6.249382e-53 1.000000e+00 5.461387e-139
83 5.962076e-143 1.000000e+00 2.474997e-229
84 1.620648e-166 1.000000e+00 1.214968e-284
85 1.460100e-104 1.000000e+00 1.627128e-56
86 5.435650e-115 1.000000e+00 2.320514e-97
87 4.252100e-136 1.000000e+00 7.652729e-132
88 5.233585e-114 1.000000e+00 1.076593e-75
89 3.364375e-108 1.000000e+00 1.960486e-98
90 5.170773e-96 1.000000e+00 8.844224e-54
91 2.399353e-116 1.000000e+00 1.564235e-67
92 2.369104e-57 1.000000e+00 5.984160e-23
93 1.586023e-119 1.000000e+00 1.344923e-80
94 1.484469e-60 1.000000e+00 3.282288e-46
95 1.300253e-107 1.000000e+00 1.283398e-61
96 9.985286e-73 1.000000e+00 1.402530e-42
97 3.686971e-96 1.000000e+00 1.308798e-55
98 1.683356e-66 1.000000e+00 1.620603e-44
99 1.008476e-129 1.000000e+00 6.726151e-87
100 7.608280e-50 1.000000e+00 1.154731e-20
101 3.825182e-85 1.000000e+00 1.162753e-192
102 2.022024e-43 1.000000e+00 3.536458e-174
103 1.322537e-55 1.000000e+00 4.846989e-143
104 1.577519e-86 1.000000e+00 1.054969e-231
105 4.338556e-101 1.000000e+00 1.474158e-197
106 1.261728e-78 1.000000e+00 6.837177e-209
107 9.369108e-44 1.000000e+00 3.189705e-131
108 2.378118e-96 1.000000e+00 3.188707e-237
109 5.296813e-63 1.000000e+00 6.022331e-159
110 6.780156e-06 9.999932e-01 1.699143e-128
111 1.872526e-34 1.000000e+00 9.761655e-120
112 5.672252e-10 1.000000e+00 2.800928e-138
113 2.520868e-61 1.000000e+00 6.490000e-149
114 3.439657e-41 1.000000e+00 1.510222e-165
115 1.613357e-80 1.000000e+00 8.393870e-198
116 4.589386e-26 1.000000e+00 2.167375e-139
117 1.333710e-133 1.000000e+00 7.219327e-193
118 1.875383e-181 1.000000e+00 6.421625e-293
119 5.416429e-142 1.000000e+00 1.382857e-212
120 1.686160e-134 1.000000e+00 1.870521e-225
121 7.970359e-148 1.000000e+00 3.536496e-218
122 1.292156e-177 1.000000e+00 3.222353e-281
123 4.198999e-96 1.000000e+00 3.384342e-165
124 2.604573e-112 1.000000e+00 8.559222e-197
125 1.837631e-140 1.000000e+00 4.157304e-204
126 1.947839e-121 1.000000e+00 3.828437e-215
127 2.478227e-127 1.000000e+00 1.775509e-191
128 1.024613e-90 1.000000e+00 9.595811e-183
129 1.396937e-126 1.000000e+00 1.546359e-184
130 3.280699e-78 1.000000e+00 1.061450e-146
131 2.298752e-132 1.000000e+00 1.046066e-200
132 3.065056e-121 1.000000e+00 1.841132e-206
133 3.030640e-114 1.000000e+00 2.000481e-72
134 8.112726e-87 1.000000e+00 2.224610e-71
135 7.842629e-91 1.000000e+00 6.644925e-43
136 3.409778e-60 1.000000e+00 2.490694e-29
137 1.683988e-121 1.000000e+00 2.224177e-74
138 1.033267e-106 1.000000e+00 5.770265e-90
139 2.701357e-84 1.000000e+00 8.079601e-33
140 8.450468e-66 1.000000e+00 1.488202e-42
141 3.152874e-67 9.999593e-01 4.068580e-05
142 1.617609e-75 1.000000e+00 2.721354e-42
143 7.409834e-126 1.000000e+00 3.397055e-75
144 8.994604e-63 1.000000e+00 7.923681e-28
145 5.783065e-103 1.000000e+00 8.678595e-44
146 4.605532e-105 1.000000e+00 4.108604e-90
147 4.342044e-80 1.000000e+00 1.572962e-65
148 1.885663e-113 1.000000e+00 3.916754e-78
149 5.687431e-113 1.000000e+00 2.382362e-66
150 2.105977e-104 1.000000e+00 3.060488e-83
151 4.036940e-91 1.000000e+00 6.654021e-48
152 1.912664e-70 1.000000e+00 3.972737e-38
153 1.000000e+00 1.772889e-109 1.371973e-36
154 1.000000e+00 1.293197e-123 1.826463e-68
155 1.000000e+00 7.520028e-117 3.422658e-36
156 1.000000e+00 6.893219e-143 8.765637e-70
157 1.000000e+00 8.385801e-122 6.314732e-71
158 1.000000e+00 1.507993e-110 7.335537e-39
159 1.000000e+00 1.320615e-93 2.768662e-51
160 1.000000e+00 2.262479e-93 3.099923e-74
161 1.000000e+00 1.120134e-78 2.435404e-46
162 1.000000e+00 1.043816e-91 2.769695e-77
163 1.000000e+00 1.266441e-61 1.520768e-53
164 1.000000e+00 2.726199e-115 3.866374e-79
165 1.000000e+00 9.945072e-102 6.813724e-41
166 1.000000e+00 8.140272e-145 4.315103e-75
167 1.000000e+00 1.696385e-74 1.841659e-45
168 1.000000e+00 3.811851e-135 1.988430e-77
169 1.000000e+00 4.187589e-56 1.407891e-47
170 1.000000e+00 4.529948e-158 3.567558e-75
171 1.000000e+00 1.564320e-102 6.698092e-50
172 1.000000e+00 7.030489e-119 2.242786e-66
173 1.000000e+00 3.026554e-158 8.663190e-63
174 1.000000e+00 3.137029e-99 3.705486e-50
175 1.000000e+00 4.648131e-89 2.375477e-42
176 1.000000e+00 1.702306e-77 1.311517e-80
177 1.000000e+00 3.163518e-101 3.495546e-46
178 1.000000e+00 3.054452e-88 1.327207e-76
180 1.000000e+00 1.724005e-125 3.039510e-76
181 1.000000e+00 3.604353e-115 2.263329e-40
182 1.000000e+00 3.118941e-135 1.353999e-67
183 1.000000e+00 7.598731e-100 9.617386e-71
184 1.000000e+00 1.264204e-73 3.452318e-56
185 1.000000e+00 6.903902e-103 9.568626e-57
186 1.000000e+00 4.939525e-210 2.816545e-50
187 1.000000e+00 3.582723e-134 7.603848e-39
188 1.000000e+00 1.878878e-100 8.760223e-78
189 1.000000e+00 4.506500e-88 2.221559e-52
190 1.000000e+00 5.736024e-45 2.434454e-95
191 1.000000e+00 4.502070e-80 3.721294e-46
192 1.000000e+00 7.073412e-113 2.117083e-81
193 1.000000e+00 5.134575e-52 8.682756e-47
194 1.000000e+00 9.195023e-126 3.007258e-71
195 1.000000e+00 1.487214e-87 2.630540e-41
196 1.000000e+00 9.689026e-107 2.712435e-62
197 1.000000e+00 4.463268e-130 5.531538e-69
198 1.000000e+00 5.495591e-91 1.539825e-47
199 1.000000e+00 1.805392e-82 2.836442e-41
200 1.000000e+00 1.028279e-123 2.594745e-65
201 1.000000e+00 7.028607e-120 1.460173e-44
202 1.000000e+00 2.064603e-77 6.387252e-86
203 1.000000e+00 3.070581e-112 2.416780e-46
204 1.000000e+00 8.443716e-131 7.699229e-61
205 1.000000e+00 5.034473e-79 8.759494e-48
206 1.000000e+00 1.247832e-118 2.619672e-56
207 1.000000e+00 1.046291e-108 3.826834e-43
208 1.000000e+00 4.093166e-71 1.731317e-77
209 1.000000e+00 6.860213e-72 2.136910e-48
210 1.000000e+00 1.269283e-79 8.616304e-73
211 1.000000e+00 2.750180e-63 1.025108e-55
212 1.000000e+00 7.667920e-138 1.981604e-63
213 1.000000e+00 1.118393e-82 1.219461e-42
214 1.000000e+00 1.993765e-98 3.966097e-72
215 1.000000e+00 6.105472e-91 1.731410e-39
216 1.000000e+00 4.647250e-168 4.056522e-51
217 1.000000e+00 1.385310e-97 2.832574e-40
218 1.000000e+00 2.621620e-114 1.352348e-72
220 1.000000e+00 8.632477e-125 8.344145e-71
221 1.000000e+00 2.591495e-77 7.813423e-46
222 1.000000e+00 3.248586e-145 3.191987e-61
223 1.000000e+00 1.311400e-104 1.558088e-43
224 1.000000e+00 3.566976e-78 7.591760e-74
225 1.000000e+00 1.138779e-97 8.241421e-70
226 1.000000e+00 4.596155e-114 9.416621e-49
227 1.000000e+00 2.254524e-91 1.187537e-46
228 1.000000e+00 9.484012e-120 4.411945e-71
229 1.000000e+00 8.464504e-111 2.395072e-42
230 1.000000e+00 8.286447e-147 7.570869e-76
231 1.000000e+00 3.488978e-101 1.392094e-42
232 1.000000e+00 2.690618e-91 4.928889e-90
233 1.000000e+00 1.674415e-120 5.032080e-38
234 1.000000e+00 1.810814e-148 3.469254e-61
235 1.000000e+00 5.692476e-108 2.484898e-44
236 1.000000e+00 1.130152e-117 5.371048e-67
237 1.000000e+00 3.160061e-99 1.046213e-45
238 1.000000e+00 4.235316e-112 4.820883e-74
239 1.000000e+00 4.723465e-70 3.696940e-48
240 1.000000e+00 3.876329e-154 2.180304e-55
241 1.000000e+00 1.269662e-123 3.931959e-41
242 1.000000e+00 1.245313e-125 2.493731e-66
243 1.000000e+00 4.957267e-110 2.215446e-44
244 1.000000e+00 1.002205e-119 6.243525e-67
245 1.000000e+00 7.196249e-94 4.540894e-49
246 1.000000e+00 1.071104e-121 1.507151e-72
247 1.000000e+00 1.726970e-85 1.237235e-52
248 1.000000e+00 1.570425e-121 1.850985e-60
249 1.000000e+00 1.501573e-99 3.577315e-70
250 1.000000e+00 4.034356e-107 2.046856e-39
251 1.000000e+00 6.894741e-118 3.942318e-46
252 1.000000e+00 3.637413e-114 1.380370e-66
253 1.000000e+00 9.974536e-115 5.983107e-40
254 1.000000e+00 4.214496e-161 7.208930e-58
255 1.000000e+00 1.839915e-101 2.583646e-51
256 1.000000e+00 2.884899e-128 2.469645e-61
258 1.000000e+00 1.986041e-94 1.689584e-85
259 1.000000e+00 2.984444e-56 4.996325e-62
260 1.000000e+00 1.247952e-155 3.919568e-62
261 1.000000e+00 1.782553e-76 5.280700e-53
262 1.000000e+00 3.314837e-122 2.323582e-79
263 1.000000e+00 1.677925e-135 1.493023e-39
264 1.000000e+00 1.198820e-137 3.330126e-69
265 1.000000e+00 8.029180e-62 6.684889e-58
266 1.000000e+00 4.356844e-129 1.647240e-62
267 1.000000e+00 1.150705e-90 5.117224e-37
268 1.000000e+00 2.544552e-178 1.158961e-53
270 1.000000e+00 7.893047e-128 6.341904e-80
271 1.000000e+00 5.236358e-127 9.840445e-39
273 1.000000e+00 2.258098e-111 1.118938e-42
274 1.000000e+00 7.780342e-140 1.175677e-69
275 1.000000e+00 7.922007e-104 3.689074e-56
276 1.000000e+00 4.308239e-118 1.367370e-77
277 9.383743e-73 4.229444e-53 1.000000e+00
278 2.414257e-46 6.692323e-33 1.000000e+00
279 4.687592e-52 1.229642e-45 1.000000e+00
280 1.048175e-60 3.444186e-21 1.000000e+00
281 5.469722e-54 1.129503e-52 1.000000e+00
282 2.241766e-70 1.074536e-56 1.000000e+00
283 4.593230e-61 5.951872e-27 1.000000e+00
284 2.288727e-61 5.338912e-73 1.000000e+00
285 1.432389e-60 1.726689e-45 1.000000e+00
286 2.039038e-50 3.258510e-34 1.000000e+00
287 9.109742e-72 1.098026e-65 1.000000e+00
288 6.685114e-45 7.275545e-30 1.000000e+00
289 3.123457e-71 3.729335e-74 1.000000e+00
290 2.497980e-64 4.169730e-94 1.000000e+00
291 1.295953e-74 4.031413e-65 1.000000e+00
292 1.838294e-49 1.795872e-43 1.000000e+00
293 7.633018e-50 2.033881e-08 1.000000e+00
294 7.713006e-95 5.573962e-187 1.000000e+00
295 2.164057e-66 8.161922e-34 1.000000e+00
296 4.115408e-48 1.364831e-66 1.000000e+00
297 1.978714e-56 2.528745e-16 1.000000e+00
298 2.777610e-57 4.234493e-43 1.000000e+00
299 1.206631e-74 3.632233e-24 1.000000e+00
300 2.515493e-47 1.752889e-37 1.000000e+00
301 1.966260e-77 2.935052e-51 1.000000e+00
302 6.646101e-54 9.510346e-75 1.000000e+00
303 3.208245e-87 2.559884e-89 1.000000e+00
304 1.575148e-52 1.308607e-37 1.000000e+00
305 1.340750e-70 2.068844e-59 1.000000e+00
306 2.051777e-51 1.462469e-77 1.000000e+00
307 1.418427e-65 1.883947e-06 9.999981e-01
308 9.302754e-49 2.214646e-66 1.000000e+00
309 6.913968e-68 6.639889e-27 1.000000e+00
310 1.217067e-57 1.420780e-69 1.000000e+00
311 6.092297e-54 2.616066e-40 1.000000e+00
312 4.367672e-85 1.831430e-104 1.000000e+00
313 5.181682e-72 1.506746e-68 1.000000e+00
314 9.562039e-46 9.723739e-63 1.000000e+00
315 1.754656e-90 1.469712e-63 1.000000e+00
316 1.634584e-54 2.249546e-86 1.000000e+00
317 7.277925e-54 1.567397e-30 1.000000e+00
318 1.370821e-69 3.584048e-60 1.000000e+00
319 6.936680e-55 4.964565e-42 1.000000e+00
320 1.631335e-79 7.066499e-64 1.000000e+00
321 2.551698e-86 8.374298e-111 1.000000e+00
322 1.666233e-54 5.578999e-83 1.000000e+00
323 4.887942e-82 2.089817e-90 1.000000e+00
324 1.767910e-51 1.671757e-39 1.000000e+00
325 3.012468e-55 3.049238e-44 1.000000e+00
326 4.008165e-89 1.181851e-112 1.000000e+00
327 7.332885e-95 1.178048e-103 1.000000e+00
328 3.782168e-57 2.803699e-64 1.000000e+00
329 1.058241e-74 9.870156e-61 1.000000e+00
330 7.903304e-51 1.246369e-44 1.000000e+00
331 1.368628e-63 1.565206e-13 1.000000e+00
332 2.730994e-62 2.762055e-61 1.000000e+00
333 5.220568e-80 2.129608e-59 1.000000e+00
334 1.514945e-45 4.068005e-24 1.000000e+00
335 2.029049e-57 3.448219e-51 1.000000e+00
336 7.675421e-61 3.661378e-11 1.000000e+00
337 8.942477e-59 1.327385e-61 1.000000e+00
338 6.212078e-80 1.959542e-90 1.000000e+00
339 6.325469e-78 5.897002e-70 1.000000e+00
340 1.160711e-67 1.231273e-101 1.000000e+00
341 1.194972e-71 8.123202e-17 1.000000e+00
342 2.133439e-53 4.893635e-52 1.000000e+00
343 5.050237e-61 1.191538e-66 1.000000e+00
344 2.773257e-79 6.303744e-89 1.000000e+00
Use the Publication data from ISLR2.
Split data into 80%-20% training and test set randomly.
Generate a multinomial logistic model to classify variable mech.
use the test data to predict mech variable. See if it is a reasonable fit.
What was the test error rate you get for the previous exercise?
\[Ave(I(y_0 \neq \hat y_0))\]
“a classifier that assigns each observation to the most likely class, given its predictor values” minimizes the test error rate.
This lowest error rate is called Bayes Error Rate
Bayes Decision Boundary
Why not always use Bayes Classifier?
Keep in mind the good old Bayes Rule
\[P(A|B) = \frac{P(B|A)* P(A)}{P(B)}\]
\(\pi_k\) is the overall probability of seeing \(k^{th}\) class of response in data.
\(f_k(X) = Pr(X|Y=k)\)
\[Pr(Y=k|X=x) = \frac{\pi_k*f_k(x)}{\sum_{l=1}^K\pi_lf_l(x)}\]
We are trying to approximate the Bayes classifier!! We will esplore linear discriminant analysis, quadratic discriminant analysis and naive Bayes
Over arching goal is to figure out the \(f_k(x)\)
To achieve our goal, we assume that \(f_k(x)\) is normal.
\[f_k(x) = \frac{1}{\sigma_k\sqrt{2\pi}}exp(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2)\]
Here, \(\mu_k\) and \(\sigma_k^2\) is the mean and variance parameter of the \(k^th\) class.
we also assume, that \(\sigma_1^2 = ...\sigma_K^2\)
\[ Pr(Y=k|X=x) = \frac{\pi_k*\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{1}{2\sigma^2}(x-\mu_k)^2)}{\sum_{l=1}^K\pi_l\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{1}{2\sigma^2}(x-\mu_k)^2)} \]
\[ log(Pr(Y=k|X=x)) = x.\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2} + log(\pi_k) \]
\[ x = \frac{\mu_1^2-\mu_2^2}{2(\mu_1-\mu_2)}= \frac{\mu_1 + \mu_2}{2} \]
lda_default_balance_student <-
MASS::lda(default ~ balance + student, data = Default)
lda_default_balance_studentCall:
lda(default ~ balance + student, data = Default)
Prior probabilities of groups:
No Yes
0.9667 0.0333
Group means:
balance studentYes
No 803.9438 0.2914037
Yes 1747.8217 0.3813814
Coefficients of linear discriminants:
LD1
balance 0.002244397
studentYes -0.249059498
training error rate
trivial null classifier
See the OJ data set in ISLR2
Use this data set to predict variable purchase
Split data into 80/20 training and testing.
Use training data to develop a LDA model. Use RoC and confusion matrix to gauge model effectiveness. Fine tune model. See chapter 9 TMWR.
predict test data with the fine tuned model.
Quadratic Discriminant Analysis
See the Smarket data in ISLR2.
Split in 80/20 training and testing.
Train LDA and QDA models.
Test these models and compare results - use test error rate.
What happens if you take n number of training data sets and n number of testing data sets, run LDA and QDA on each pair and plot training error rate and testing error rate distributions?
\[f_k(x) = f_{k1}(x_1)*f_{k2}(x_2)*...*f_{kp}(x_p)\]
\[pr(X) = \frac{\pi_k*f_{k1}(x_1)*f_{k2}(x_2)*...*f_{kp}(x_p)}{\sum_{l=1}^K \pi_l*f_{l1}(x_1)*f_{l2}(x_2)*...*f_{lp}(x_p)}\] > How is \(f_{kj} estimated?\)
use naiveBayes function from e1071 package.
Use Smarket data and compared results with QDA.